Class Material

【機器學習 2022】語音與影像上的神奇自督導式學習 -Self-supervised Learning- 模型

Self-supervised Learning 主要用在 NLP 中。

BERT 也可用在 Speech 中，根据声音识别文字。

SUPERB (superbbenchmark.org)

[2105.01051] SUPERB: Speech processing Universal PERformance Benchmark (arxiv.org)
[2203.06849] SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities (arxiv.org)

将 Self-supervised Learning 用在图像上：

Image Recognition
Object Detection
Semantic Segmentation
Visual Navigation

[2110.09327] Self-Supervised Representation Learning: Introduction, Advances and Challenges (arxiv.org)

有时 Self-supervised Learning 的效果比 Supervised Learning 还好。

Generative Approaches

BERT 使用 Masking 处理 NLP，同样的用此法处理 Speech 问题。

[1910.12638] Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders (arxiv.org)

对于 BERT：

[1910.12638] Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders (arxiv.org) Mask 语音的一个片段用于训练（片段不能太短，否则太好猜了）
[2007.06028] TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech (arxiv.org) Mask 某些维度，而不是语音片段

对于 CPT：

预测接下来的语音序列（不能太短，不然太好猜了）

[1910.12607] Generative Pre-Training for Speech with Autoregressive Predictive Coding (arxiv.org)

Predictive Approach

不使用生成的方法训练模型。

给出一张旋转的图片，让计算机判断旋转的角度。

[1803.07728] Unsupervised Representation Learning by Predicting Image Rotations (arxiv.org)

给出两张图片片段，让计算机判断两张图片的位置关系。

[1505.05192] Unsupervised Visual Representation Learning by Context Prediction (arxiv.org)

判断两部分 Speech 片段是时间关系。

Pre-Training Audio Representations With Self-Supervision | IEEE Journals & Magazine | IEEE Xplore

让 Speech 直接判断 Clustering 后的分类结果。

Speech

Image

[1807.05520] Deep Clustering for Unsupervised Learning of Visual Features (arxiv.org)

Contrastive Learning

给几张图片，经过一些图像处理后让计算机还能认出原始图片（希望同一只猫输出的向量越接近越好，猫和狗输出的向量越远越好）。

[2002.05709] A Simple Framework for Contrastive Learning of Visual Representations (arxiv.org)

Speech 版本：

[2010.13991] Speech SIMCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning (arxiv.org)

Moco：比 SimCLR 多了 memory bank 和 momentum encoder

[1911.05722] Momentum Contrast for Unsupervised Visual Representation Learning (arxiv.org)

Moco v2：借鉴了 SimCLR

[2003.04297] Improved Baselines with Momentum Contrastive Learning (arxiv.org)

Encoder 出来的东西一部分给 Predicter，要求 Predicter 出来的东西与剩下 positive 类的尽可能相近，negative 类尽可能相远。CPC 中 Predicter 用的是 GRU，Wav2vec 用的是 CNN。

VQ-wav2vec 输出的不是连续的，而是离散的（利用 BERT、去除噪声）。

[1910.05453] vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations (arxiv.org)

[1911.03912] Effectiveness of self-supervised pre-training for speech recognition (arxiv.org)

[2006.11477] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (arxiv.org)

Classification 分类 v.s. Contrastive 对比

对比思想动机：人类不仅能从积极的信号中学习，还能从纠正不良行为中获益。对比学习其实是无监督学习的一种范式。

选择 Negative Examples 并不容易，不能太复杂，也不太简单。

Bootstrapping Approaches

如果单纯的不给负类，机器就会倾向输出完全一样的向量，产生 Collapse 错误。（左）

给图像做一定变换后，经过 Encoder-Predictor 后的向量学习尽可能与直接 Encoder 的相等，并更新 Encoder。（右）

Typical Knowledge Distillation：小模型（student）不断训练变得输出跟大模型一样（Teacher）。（左）

把接了 Predictor 的 Encoder 当作学生，每一轮回后变成新的老师。（右）

Simply Extea Regularization

分成三个部分：

Invariance
Variance
Covariance
- 想办法让 Invariance 和 Variance 的协方差矩阵非对角线元素接近于 0

Concluding Remarks

方法	图像	语音
生成	GPT for image	Mockingjay, APC
预测	Rotation Prediction, etc.	HuBERT
对比	SimCLR,MoCo	CPC, Wav2vec series
Bootstrapping	BYOL, SimSiam	Data2vec
正则化	Barlow Twins, VICReg	DeLoRes